Analysis of the stats of the Software packages in Bioconductor project,
Here we are going to analyse the software packages of Bioconductor. See the home of the analysis here were we already transormated the data. From that stats we are going to analyse the software category:
load("stats.RData", verbose = TRUE)
## Loading objects:
## stats
## bioc_packages
## monthsConvert
stats <- stats[Category == "Software", ]
stats
## Package Date Month Year Category
## 1: ABarray 2017-01-01 01:00:00 01 2017 Software
## 2: ABarray 2017-02-01 01:00:00 02 2017 Software
## 3: ABarray 2017-03-01 01:00:00 03 2017 Software
## 4: ABarray 2017-04-01 02:00:00 04 2017 Software
## 5: ABarray 2017-05-01 02:00:00 05 2017 Software
## 6: ABarray 2017-06-01 02:00:00 06 2017 Software
## 7: ABarray 2016-01-01 01:00:00 01 2016 Software
## 8: ABarray 2016-02-01 01:00:00 02 2016 Software
## 9: ABarray 2016-03-01 01:00:00 03 2016 Software
## 10: ABarray 2016-04-01 02:00:00 04 2016 Software
## ---
## 80687: timescape 2017-05-01 02:00:00 05 2017 Software
## 80688: timescape 2017-06-01 02:00:00 06 2017 Software
## 80689: twoddpcr 2017-02-01 01:00:00 02 2017 Software
## 80690: twoddpcr 2017-03-01 01:00:00 03 2017 Software
## 80691: twoddpcr 2017-04-01 02:00:00 04 2017 Software
## 80692: twoddpcr 2017-05-01 02:00:00 05 2017 Software
## 80693: twoddpcr 2017-06-01 02:00:00 06 2017 Software
## 80694: wiggleplotr 2017-04-01 02:00:00 04 2017 Software
## 80695: wiggleplotr 2017-05-01 02:00:00 05 2017 Software
## 80696: wiggleplotr 2017-06-01 02:00:00 06 2017 Software
## Nb_of_distinct_IPs Nb_of_downloads
## 1: 105 153
## 2: 155 229
## 3: 119 216
## 4: 184 272
## 5: 192 282
## 6: 113 150
## 7: 170 289
## 8: 161 278
## 9: 185 367
## 10: 236 398
## ---
## 80687: 78 101
## 80688: 49 73
## 80689: 3 3
## 80690: 9 15
## 80691: 36 50
## 80692: 82 129
## 80693: 56 79
## 80694: 36 50
## 80695: 80 106
## 80696: 47 59
There have been 1686 Software packages in Bioconductor.
First we explore the number of packages being downloaded by month:
stats <- stats[Nb_of_downloads != 0, ] # We remove rows of packages with a download in that month.
theme_bw <- theme_bw(base_size = 16)
theme <- theme(axis.text.x=element_text(angle = 60, hjust = 1))
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages downloaded") +
theme +
scal +
xlab("")
Figure 1: Packages in Bioconductor with downloads
The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages
ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Downloads") +
scal +
theme +
xlab("")
Figure 2: Downloads of packages
Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads),
sem = sd(Nb_of_downloads)/sqrt(.N)),
by = Date], aes(Date, Number)) +
geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem),
width = .1, position = pd) +
geom_point() +
geom_line() +
theme_bw +
ggtitle("Downloads") +
ylab("Mean download for a package") +
scal +
theme +
xlab("")
Figure 3: Downloads of packages per package
The error bar indicates the standard error of the mean.
Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is more dispersion between packages downloads.
This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.
today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme +
xlab("")
Figure 4: New packages
We can see that there were more than 350 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 50 new downloads (Which would be new packages being added).
ggplot(histincorporation, aes(Date, Number)) +
geom_bar(stat="identity") +
theme_bw +
ggtitle("Packages with first download") +
scal +
theme +
ylim(c(0, 60)) +
ylab("New packages") +
xlab("")
## Warning: Removed 1 rows containing missing values (position_stack).
Figure 5: New packages
Zoom on the new downloads of packages after 2009.
We can now observe that for each year there are two spikes of new downloads of packages, usually they are the packages being added for the new release of Bioconductor.
Using a similar procedure we can approximate the packages deprecated and removed each month, although a package could not be downloaded and still included in Bioconductor. In this case we look for the last date a package was downloaded, excluding the current month:
deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date", "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) +
geom_bar(stat = "identity") +
theme_bw +
ggtitle("Packages without downloads") +
scal +
theme +
ylab("Last seen packages") +
xlab("")
Figure 6: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.
Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 80 packages as per last month.
We further explore how many time between the incorporation of the package and the last download.
df <- merge(incorporation, deprecation, by = "Package")
# Transform to years
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(365*60*60*24)
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Years")
abline(v = mean(timeBioconductor), col = "red")
Figure 7: Time of packages between first and last download
The red line indicates the mean time in Bioconductor
We can see that most deprecated packages are less than a year (I would say around two releases) and some stay on Bioconductor up to 6 years before beeing removed. Not surprisingly the number of packages incorporated before 2009 and removed from the repository are 0 packages. But those packages not removed how do they do in Bioconductor?
We can start comparing the number of downloads to how many IPs download each package.
pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads/Nb_of_distinct_IPs),
sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N)),
by = c("Date")], aes(Date, Number)) +
geom_point() +
geom_errorbar(aes(ymin = Number - sem, ymax = Number + sem),
width = .1, position = pd) +
geom_line() +
theme_bw +
ggtitle("Downloads per IP") +
ylab("Mean downloads per IP") +
xlab("") +
theme +
scal
Figure 8: Downloads per IP
The error bars indicate the standard error of the mean.
We can see that usually the number of downloads per IP is around 2, but that there is much variation between the packages. In the points marked in red, the variation is bigger than the mean, this might be due to specific packages being downloaded mostly from the same IP:
ratio <- stats[, .(Mean = mean(Nb_of_downloads/Nb_of_distinct_IPs),
sem = sd(Nb_of_downloads/Nb_of_distinct_IPs)/sqrt(.N),
sd = sd(Nb_of_downloads/Nb_of_distinct_IPs),
max = max(Nb_of_downloads/Nb_of_distinct_IPs),
min = min(Nb_of_downloads/Nb_of_distinct_IPs)),
by = c("Package")]
ratio <- ratio[order(Mean, decreasing = TRUE), ]
ratio$Package <- as.character(ratio$Package)
ratio
## Package Mean sem sd max min
## 1: flowCore 37.495046 8.082629 81.630551 385.518002 1.424051
## 2: phosphonormalizer 7.174051 1.565667 4.428375 13.500000 1.500000
## 3: cummeRbund 7.074995 1.462331 12.147034 70.841699 1.583333
## 4: REMPdata 6.000000 NA NA 6.000000 6.000000
## 5: mosaics 5.884932 1.938655 16.900803 93.465831 1.096774
## 6: EpicopyData 5.666667 NA NA 5.666667 5.666667
## 7: cellscape 5.454799 1.249310 2.793542 9.562500 3.013514
## 8: mzR 5.182608 1.138364 9.592029 49.054572 1.000000
## 9: topOnto.HDO.db 5.000000 NA NA 5.000000 5.000000
## 10: minfi 4.673779 2.020262 16.781555 125.405208 1.479730
## ---
## 1677: IntramiRExploreR 1.000000 0.000000 0.000000 1.000000 1.000000
## 1678: MetCleaning 1.000000 NA NA 1.000000 1.000000
## 1679: OPWeight 1.000000 NA NA 1.000000 1.000000
## 1680: Rhdf5lib 1.000000 0.000000 0.000000 1.000000 1.000000
## 1681: Trumpet 1.000000 0.000000 0.000000 1.000000 1.000000
## 1682: cnAnalysis450k 1.000000 NA NA 1.000000 1.000000
## 1683: coexnet 1.000000 NA NA 1.000000 1.000000
## 1684: cytofWorkflow 1.000000 NA NA 1.000000 1.000000
## 1685: exomecopy 1.000000 NA NA 1.000000 1.000000
## 1686: miRBaseConverter 1.000000 NA NA 1.000000 1.000000
We can see that the package with more downloads from the same IP is flowCore, followed by, phosphonormalizer, cummeRbund and the forth one is REMPdata. We can see that some (118) packages have been downloaded each time from different IP. There are 50 package with more dispersion than mean download per IP, which suggest that are packages highly downloaded in some specific places.
I am curious how are the default packages of Bioconductor downloaded, let’s see where they are:
ratio[Package %in% bioc_packages, ]
## Package Mean sem sd max min
## 1: BiocInstaller 2.265981 0.13800574 1.1791215 8.487406 1.190476
## 2: Biobase 2.140476 0.11760358 1.1877379 11.021502 1.632311
## 3: AnnotationDbi 1.877154 0.04585692 0.4631322 5.708393 1.454985
## 4: IRanges 1.775943 0.02636964 0.2663203 3.781343 1.408883
## 5: S4Vectors 1.671576 0.02352325 0.1469027 2.014107 1.406886
## 6: BiocGenerics 1.585254 0.01258922 0.1038133 2.075613 1.329412
BiocInstaller is base package more downloaded per IP, maybe because the is necessary to install the other packages in Bioconductor.
Now we explore if there is some seasons cycles in the downloads, as in figure 2 seems to be some cicles.
First we can explore the number of IPs per month downloading each package:
ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Distinct IP downloads") +
scal +
theme +
guides(col = FALSE)
Figure 9: Distinct IP per package
As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Package Downloads") +
ylab("Downloads") +
scal +
theme +
guides(col = FALSE)
Figure 10: Downloads per year
Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Downloads per package") +
ylab("Downloads") +
scal +
ylim(0, 50000)+
theme +
guides(col = FALSE)
## Warning: Removed 28 rows containing missing values (geom_path).
Figure 11: Downloads per year
There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:
ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) +
geom_line() +
theme_bw+
ggtitle("Downloads per package") +
ylab("Downloads") +
scal +
ylim(0, 10000)+
theme +
guides(col = FALSE)
## Warning: Removed 824 rows containing missing values (geom_path).
Figure 12: Downloads per year
As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.
Maybe there is a relationship between the downloads and the number of IPs per date
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Spread of downloads") +
ylab("Downloads per IPs") +
scal +
theme +
guides(col = FALSE)
Figure 13: Ratio downloads per IP per package
We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:
ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) +
geom_line() +
theme_bw +
ggtitle("IPs") +
ylab("Ratio") +
scal +
theme +
guides(col = FALSE) +
ylim(1, 5)
Figure 14: Ratio downloads per IP per package
But most of the packages seem to be more or less constant and around 2.
We can observe if the packages has been consistently the most downloaded package of Bioconductor:
norm <- stats[, {maxD = sum(Nb_of_downloads)
.SD[, .(Downloads = Nb_of_downloads/maxD), by = Package]}, by = Date]
norm <- norm[order(Date, order(Downloads)), ]
ggplot(norm[, .(Package, rank = rank(Downloads)), by = Date], aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Rank of packages by downloads") +
xlab("") +
ylab("Rank of the package") +
guides(col = FALSE) +
scal +
theme
Figure 15: Position of packages in Bioconductor
We can observe the increase of the number of packages downloaded, specially in January 2006. The top packages remain more or less same, while the packages less downloaded usually remain so. If we try to see the evolution of all of them it is impossible to distinguish them:
norm2 <- norm[, .(Package, rank = rank(Downloads)/.N), by = Date]
ggplot(norm2, aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Relative rank of packages by downloads") +
xlab("") +
ylab("Position by downloads") +
guides(col = FALSE) +
scal +
theme
Figure 16: % of packages in Bioconductor
Only if we select to follow some package we can track them:
packages <- c("limma", "GOSemSim", "BioCor", "Clonality", "Prostar",
"rintact", "bioassayR", "DESeq", "DESeq2", "edgeR")
ggplot(norm2[Package %in% packages, ], aes(Date, rank, col = Package))+
geom_line() +
theme_bw +
ggtitle("Relative rank of packages by downloads") +
xlab("") +
ylab("Position by downloads") +
scal +
theme
Figure 17: Evolution of downloads of packages in Bioconductor
We can also observe when did a package reach the maximum number of downloads:
norm3 <- stats[, {maxD = sum(Nb_of_downloads)
.SD[, .(Downloads = Nb_of_downloads/maxD), by = Date]}, by = Package]
norm4 <- norm3[, .(Date, rank = Downloads/max(Downloads)), by = Package]
ggplot(norm4, aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Growth of the packages") +
xlab("") +
ylab("Downloads/max(Downloads)") +
guides(col = FALSE) +
scal +
theme
Figure 18: Cycle of packages
Position of package downloads respect the maximum downloads of the packages along time.
As usuall we need to focus on fewer packages to be able to distinguish them:
ggplot(norm4[Package %in% c(packages, "RTools4TB", "SemSim"), ],
aes(Date, rank, col = Package)) +
geom_line() +
theme_bw +
ggtitle("Growth of the packages") +
xlab("") +
ylab("Downloads/max(Downloads)") +
scal +
theme
Figure 19: Cycle of few packages
Position of package downloads respect the maximum downloads of the packages along time.
As expected the packages that keep up with Bioconductor growth have a peak near the end of the serie. For this reason I added the package SemSim and the RTools4TB to see that package that has been less and less downloaded.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1 BiocStyle_2.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10 knitr_1.15.1 magrittr_1.5 munsell_0.4.3
## [5] colorspace_1.3-2 stringr_1.2.0 highr_0.6 plyr_1.8.4
## [9] tools_3.4.0 grid_3.4.0 gtable_0.2.0 htmltools_0.3.6
## [13] yaml_2.1.14 lazyeval_0.2.0 rprojroot_1.2 digest_0.6.12
## [17] tibble_1.3.0 bookdown_0.3 evaluate_0.10 rmarkdown_1.5
## [21] labeling_0.3 stringi_1.1.5 compiler_3.4.0 scales_0.4.1
## [25] backports_1.0.5